Is the Multigrid Method Fault Tolerant? The Multilevel Case

نویسندگان

  • Mark Ainsworth
  • Christian Glusa
چکیده

Computing at the exascale level is expected to be affected by a significantly higher rate of faults, due to increased component counts as well as power considerations. Therefore, current day numerical algorithms need to be reexamined as to determine if they are fault resilient, and which critical operations need to be safeguarded in order to obtain performance that is close to the ideal fault-free method. In a previous paper [1], a framework for the analysis of random stationary linear iterations was presented and applied to the two grid method. The present work is concerned with the multigrid algorithm for the solution of linear systems of equations, which is widely used on high performance computing systems. It is shown that the Fault-Prone Multigrid Method is not resilient, unless the prolongation operation is protected. Strategies for fault detection and mitigation as well as protection of the prolongation operation are presented and tested, and a guideline for an optimal choice of parameters is devised.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Asynchronous parallel solvers for linear systems arising in computational engineering

Modern trends in Computational Science and Engineering are moving towards the use of computer systems with ever increasing numbers of computational cores. A consequence of this is that over the next decade it will be necessary to develop and apply new numerical algorithms that are far more scalable than has historically been required. Ideally, such algorithms will be able to exploit many thousa...

متن کامل

Fault Diagnosis and Fault-Tolerant SVPWM Technique of Six-phase Converter under Open-Switch Fault

In this paper, a new open-switch fault diagnosis method is proposed for the six-phase AC-DC converter based on the difference between the phase current and the corresponding reference using an adaptive threshold. The open-switch faults are detected without any additional equipment and complicated calculations, since the proposed fault detection method is integrated with the controller required ...

متن کامل

Fault tolerant system with imperfect coverage, reboot and server vacation

This study is concerned with the performance modeling of a fault tolerant system consisting of operating units supported by a combination of warm and cold spares. The on-line as well as warm standby units are subject to failures and are send for the repair to a repair facility having single repairman which is prone to failure. If the failed unit is not detected, the system enters into an unsafe...

متن کامل

Fault-tolerant finite-element multigrid algorithms with hierarchically compressed asynchronous checkpointing

We examine novel fault tolerance schemes for data loss in multigrid solvers which essentially combine ideas of checkpoint-restart with algorithm-based fault tolerance. To improve efficiency compared to conventional global checkpointing, we exploit the inherent data compression of the multigrid hierarchy, and relax the synchronicity requirement through a local failure local recovery approach. We...

متن کامل

Is the Multigrid Method Fault Tolerant? The Two-Grid Case

The predicted reduced resiliency of next-generation high performance computers means that it will become necessary to take into account the effects of randomly occurring faults on numerical methods. Further, in the event of a hard fault occurring, a decision has to be made as to what remedial action should be taken in order to resume the execution of the algorithm. The action that is chosen can...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • SIAM J. Scientific Computing

دوره 39  شماره 

صفحات  -

تاریخ انتشار 2017